4 research outputs found

    Graph-Based Weakly-Supervised Methods for Information Extraction & Integration

    Get PDF
    The variety and complexity of potentially-related data resources available for querying --- webpages, databases, data warehouses --- has been growing ever more rapidly. There is a growing need to pose integrative queries across multiple such sources, exploiting foreign keys and other means of interlinking data to merge information from diverse sources. This has traditionally been the focus of research within Information Extraction (IE) and Information Integration (II) communities, with IE focusing on converting unstructured sources into structured sources, and II focusing on providing a unified view of diverse structured data sources. However, most of the current IE and II methods, which can potentially be applied to the pro blem of integration across sources, require large amounts of human supervision, often in the form of annotated data. This need for extensive supervision makes existing methods expensive to deploy and difficult to maintain. In this thesis, we develop techniques that generalize from limited human input, via weakly-supervised methods for IE and II. In particular, we argue that graph-based representation of data and learning over such graphs can result in effective and scalable methods for large-scale Information Extraction and Integration. Within IE, we focus on the problem of assigning semantic classes to entities. First we develop a context pattern induction method to extend small initial entity lists of various semantic classes. We also demonstrate that features derived from such extended entity lists can significantly improve performance of state-of-the-art discriminative taggers. The output of pattern-based class-instance extractors is often high-precision and low-recall in nature, which is inadequate for many real world applications. We use Adsorption, a graph based label propagation algorithm, to significantly increase recall of an initial high-precision, low-recall pattern-based extractor by combining evidences from unstructured and structured text corpora. Building on Adsorption, we propose a new label propagation algorithm, Modified Adsorption (MAD), and demonstrate its effectiveness on various real-world datasets. Additionally, we also show how class-instance acquisition performance in the graph-based SSL setting can be improved by incorporating additional semantic constraints available in independently developed knowledge bases. Within Information Integration, we develop a novel system, Q, which draws ideas from machine learning and databases to help a non-expert user construct data-integrating queries based on keywords (across databases) and interactive feedback on answers. We also present an information need-driven strategy for automatically incorporating new sources and their information in Q. We also demonstrate that Q\u27s learning strategy is highly effective in combining the outputs of ``black box\u27\u27 schema matchers and in re-weighting bad alignments. This removes the need to develop an expensive mediated schema which has been necessary for most previous systems

    Interpretable semantic vectors from a joint model of brain-and text-based meaning

    No full text
    Vector space models (VSMs) represent word meanings as points in a high dimen-sional space. VSMs are typically created using a large text corpora, and so repre-sent word semantics as observed in text. We present a new algorithm (JNNSE) that can incorporate a measure of semantics not previously used to create VSMs: brain activation data recorded while people read words. The resulting model takes advan-tage of the complementary strengths and weaknesses of corpus and brain activation data to give a more complete representa-tion of semantics. Evaluations show that the model 1) matches a behavioral mea-sure of semantics more closely, 2) can be used to predict corpus data for unseen words and 3) has predictive power that generalizes across brain imaging technolo-gies and across subjects. We believe that the model is thus a more faithful represen-tation of mental vocabularies.

    FlexiFaCT: Scalable Flexible Factorization of Coupled Tensors on Hadoop

    No full text
    <p>Given multiple data sets of relational data that share a number of dimensions, how can we efficiently decompose our data into the latent factors? Factorization of a single matrix or tensor has attracted much attention, as, e.g., in the Netflix challenge, with users rating movies. However, we often have additional, side, information, like, e.g., demographic data about the users, in the Netflix example above. Incorporating the additional information leads to the <em>coupled factorization</em> problem. So far, it has been solved for relatively small datasets.</p> <p>We provide a distributed, scalable method for decomposing matrices, tensors, and coupled data sets through stochastic gradient descent on a variety of objective functions. We offer the following contributions: (1) <em>Versatility:</em> Our algorithm can perform matrix, tensor, and coupled factorization, with flexible objective functions including the Frobenius norm, Frobenius norm with an â„’<sub>1</sub> induced sparsity, and non-negative factorization. (2) <em>Scalability</em>: FlexiFaCT scales to unprecedented sizes in both the data and model, with up to <em>billions</em> of parameters. FlexiFaCT runs on standard Hadoop. (3) <em>Convergence proofs</em> showing that Flexi-FaCT converges on the variety of objective functions, even with projections.</p
    corecore